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Many proteins have evolved to form specific molecular complexes and 
the specificity of this interaction is essential for their function. The net- 
work of the necessary inter-residue contacts must consequently constrain 
the protein sequences to some extent. In other words, the sequence of an 
interacting protein must reflect the consequence of this process of adap- 
tation. It is reasonable to assume that the sequence changes accumulated 
during the evolution of one of the interacting proteins must be compen- 
sated by changes in the other. 

Here we apply a method for detecting correlated changes in multiple 
sequence alignments to a set of interacting protein domains and show 
that positions where changes occur in a correlated fashion in the two 
interacting molecules tend to be dose to the protein-protein interfaces. 
This leads to the possibility of developing a method for predicting con- 
tacting pairs of residues from the sequence alone. Such a method would 
not need the knowledge of the structure of the interacting proteins, and 
hence would be both radically different and more widely applicable than 
traditional docking methods. 

We indeed demonstrate here that the information about correlated 
sequence changes is sufficient to single out the right inter-domain dock- 
ing solution amongst many wrong alternatives of two-domain proteins. 
The same approach is also used here in one case (haemoglobin) where 
we attempt to predict the interface of two different proteins rather than 
two protein domains. Finally, we report here a prediction about the 
inter-domain contact regions of the heat- shock protein Hsc70 based only 
on sequence information. 
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Introduction 



The protein-protein Interaction problem 

Molecular recognition is a key process in biologi- 
cal systems. The order and control of protein-pro- 
tein interactions in signalling pathways and 
metabolic networks are important aspects of mol- 
ecular biology and biochemistry. DNA replication 
and transcription, RNA splicing, protein sorting, 
cell adhesion, signalling cascades and metabolic 
cycles are just some examples of the many complex 
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processes dominated by protein-protein recog- 
nition. 

The unravelling of this complex process requires 
two major steps. First, it is necessary to find the 
interacting proteins in the cell soup (e.g. the long 
quest for the downstream effectors in the ras-p21 
signalling cascade); and second, to describe at the 
molecular level how the interaction takes place, 
eg. how and where ras-p21 interacts with the raf- 
kinase and how conformational changes related 
with GTP hydrolysis control the interaction 
between the proteins. 

The first issue of searching for the interacting 
components in a functional complex is a daily pro- 
blem for experimental biology but has remained so 
far untouched from the theoretical point of view. 
The problem of describing and predicting the mol- 
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ecular complexes in detail also known as the 
"docking problem" has instead attracted a great 
deal of attention and has led to the development of 
several different theoretical methods. 



Current physical approaches to the 
docking problem 

Docking has attracted much attention (for recent 
reviews, see Lengauer & Rarey, 1996; Strynadka 
et al., 1996). Undoubtedly, progress has been made, 
and some methods are ready for challenges such 
as the prediction of the interaction between lacta- 
mase and one of its inhibitors. It was accomplished 
quite successfully by six different groups 
(Strynadka et al, 1996), or the more recent CASP-2 
meeting (WWW: http://iris4.carb .nistgov/casp2). 
All current docking methods require the three- 
dimensional structures of the interacting proteins 
to be known. In all methods, the interacting 
surfaces are described by different physical proper- 
ties (Connolly surfaces, grids, protein slices, 
property vectors, etc.) to allow the identification of 
geometrically complementary regions between the 
two proteins (Cherfils et aL, 1991; Fischer et al, 
1995; Helmer-Citterich & Tramontane 1994; 
Jackson & Sternberg, 1995; Jiang & Kim, 1991; 
Shoichet & Kuntz, 1991; Stoddard & Koshland, 
1992; Walls & Sternberg, 1992; Katchalski-Katzir 
efaJ.,1992). 

Most algorithms treat proteins as rigid bodies 
and in only a few cases is protein flexibility 
taken into account (Totrov & Abagyan, 1994; 
ODonoghue, S. & Nilges, M., personal communi- 
cation). Flexibility is an inherent difficulty in the 
docking problem, since most inter-protein com- 
plexes undergo induced-fit movements upon bind- 
ing, and hence a rigid-body description of the 
individual components may not be accurate en- 
ough to predict the structure of the final complex. 

Docking methods and the characteristics of 
protein Interfaces 

It is generally accepted that the physical prin- 
ciples underlying protein folding and protein-pro- 
tein association are similar. This belief is supported 
by detailed studies of similarities in the packing of 
protein interfaces and protein interiors (Walls & 
Sternberg, 1992), and similarities in the overall 
resemblance of the hydrophobicity patterns (Young 
et al., 1994). However, our understanding of the 
peculiar characteristics of protein-protein inter- 
action is still very limited. Earlier attempts to study 
complementary surfaces between proteins (Argos, 
1988; Janin & Chothia, 1990; Janin et oL, 1988) were 
hampered by the lack of experimental data. More 
recent studies (Jones & Thornton, 1996; Tsai et al, 
1996) are for the first time providing tools for a 
systematic approach to the characterisation of pro- 
tein-protein interfaces by rigorous scanning of 
data bases of protein complexes. 



The study of the evolution of oligomerisation 
has also become an important issue and the first 
ideas about the origin of the adaptation in protein 
complexes from the initial components are emer- 
ging (Fletterick & Bazan, 1995; Bennett et al., 1995). 

A new approach to predict protein-protein 
contact regions based on sequence Information 

We propose here a new and completely different 
approach to the study and prediction of protein- 
protein interaction. Instead of considering the 
structural nature of the interactions, we try to de- 
tect the sequence traces that evolution may have 
left on the interacting sequences during the process 
of preserving the protein-protein interaction sites. 
Therefore, our approach is not restricted to the 
cases in which die structures of the proteins to be 
docked are known and is applicable to any family 
of interacting proteins for which a large enough se- 
quence family is available. 

Sequence Information and the process of 
protein- protein co-adaptation 

There is a common agreement among research- 
ers that interacting proteins undergo a process of 
co-evolution. "Over time, amino acid substitution 
may stabilise an interface that does not exist in the 
closed monomer . . . stabilising mutations in these 
interfaces would be favoured in natural selection" 
(Bennett et al, 1995); however, no explicit strategy 
has been proposed for detecting the traces of this 
process from protein sequences. We propose that it 
is possible to detect this signal by studying com- 
pensatory mutations. In order to do so, we have 
appropriately modified our previously published 
method for the calculation of correlated mutations 
in multiple-sequence alignments (Gdbel et al., 1994; 
Pazosrt a/., 1997). 

Defining correlated mutations 

Several groups have studied correlated 
mutations: technical differences between different 
approaches have led to conflicting conclusions 
about the nature and intensity of this phenomena 
(Altschuh et al, 1987, 1988; Gdbel et al., 1994; 
Neher, 1994; Shindyalov et al., 1994; Taylor & 
Hatrick, 1994). Thus, it is important to define 
exactly our notion of correlated mutations. In this 
and previous work, we have used the term corre- 
lated mutations to indicate a tendency of positions 
in proteins to mutate co-ordinately. We measure 
this tendency by analysing the correlation between 
changes in pairs of positions in multiple sequence 
alignments, with an unambiguous definition of 
correlation (see Methods). 

Biological meaning f compensatory mutaH ns 

There is dear experimental support for the role 
of compensating mutations in protein stability 
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(Serrano et al, 1990) and functi n (Gregoret & 
Sauer, 1993). Vernet et al. (1992) directly tested the 
influence in protein stability of some pairs of corre- 
lated mutations. We believe that the signal 
detected by our method corresponds mainly to 
networks of positions that have undergone 
compensating mutations during evolution. If inter- 
actions between proteins are of the same physical 
nature as intra -protein interactions, then their con- 
sequences at the sequence level are most likely also 
similar. Therefore, we apply our method, which 
we have previously used to predict contacts in 
globular proteins (Gobel et al., 1994; Pazos et al, 
1997), to the problem of predicting interactions 
between proteins. As we will show here, the signal 
at the sequence level for inter-protein contacts 
turns out to be even more specific than that for 
intra-protein contacts, possibly because it is subject 
to a stronger selective pressure. 



Testing the method 

The purpose of this work is to test the feasibility 
of a sequence-based approach to the prediction of 
interacting regions in protein complexes. To do so, 
we first show that correlated pairs between two 
different proteins are significantly closer to each 
other than other pairs of positions in the same pro- 
teins, and second that they can be used to discrimi- 
nate the correct docking solutions among many 
alternative wrong ones in proteins of known struc- 
ture. We then carry out a bona fide prediction of the 
yet unknown interaction site between the two 
domains of the Hsc70 heat shock protein 



Results 

The results are presented in the following order 
first we demonstrate that correlated mutations do 
contain information about inter-domain contacts. 
We tested our method mainly for inter-domain 
interactions to take advantage of the larger set of 
examples of proteins of known structure for which 
many homologous sequences are available. As 
described in detail later, we demonstrate that, on 
average, pairs of residues detected as "correlated" 
by our method are closer to each other than the 
average pairs of residues in the same protein. 

Next we show how the correlated mutation anal- 
ysis can be used to identify docking solutions very 
close to the native solution from many wrong sol- 
utions. The aim of this experiment is to empirically 
evaluate how much information about inter-do- 
main contacts is contained in correlated mutations. 
It is important to remember that we are not 
attempting to replace existing docking methods 
based on structural information; we only want to 
establish that correlated mutations are good indi- 
cat rs of contacting residues. Our results should 
not be compared to any current docking method. 



The docking algorithm here is used only as a rapid 
tool to generate many alternative "reasonable" sol- 
utions. In a first set of experiments, we generated 
thousands of alternative solutions that fully cover 
the space of possible solutions without any attempt 
to increase the number of solutions close to the real 
docking position. The second set of experiments 
corresponds to a more demanding test, since in 
this case the set of solutions among which we 
want to discriminate consists of hundreds of physi- 
cally realistic solutions, corresponding to the best 
scoring complementary surfaces calculated with a 
standard docking program. 

The third section contains the results obtained 
for the complex between a and p haemoglobin, a 
test selected to prove that our method has similar 
performances when applied to inter-protein as well 
as inter-domain contacts. A more complete test of 
protein-protein complexes is prevented by the 
very limited availability of data on protein-protein 
complexes where both the structure and a suffi- 
cient number of aligned sequences for the same 
species (see later) are known. 

Finally, a bona fide prediction is reported. 
Sequence information is used to predict die con- 
tacting residues between the two domains of the 
heat-shock protein Hsc70. This prediction is used 
to illustrate the novel feature of our approach: that 
prediction can be made in the absence of structural 
information. The example is also biologically rel- 
evant the function of Hsc70 is based on the inter- 
action between its two domains. The N-terminal 
(Nt) domain contains the ATP-binding site, while 
the C-terminal (Ct) domain is mainly responsible 
for peptide binding (Chappel et al., 1987; Gragerov 
et al., 1994; Montgomery et al., 1993). The inter- 
action between the two domains generates the bio- 
logical functions of peptide binding and release 
(McCarty et al., 1995). The structure of both iso- 
lated domains (Nt domain of hsc70; Flaherty et al., 
1990; Ct domain of its related protein DnaK, Zhu 
et al., 1996) has been solved, although the Ct 
domain structure is not yet publicly available. This 
is an appropriate moment for a bona fide prediction, 
since the structure of the complex has not yet been 
solved and "classical" docking methods cannot be 
used until both structures are available. 

Prediction of domain-domain contacts for 
different protein families 

We have previously shown that in single-do- 
main proteins correlated residues tend to be closer 
than other residues (Gobel et al., 1994). This general 
result is illustrated in Figure 1(a) for papain (9pap, 
Kamphuis et al., 1984). In the Figure we compare 
the distribution of the distances between pairs of 
correlated residues with that of the distances 
between all pairs of residues in each f the 
domains of papain. The distribution shows a clear 
shift of the population of correlated positions toward 
closer distances. This example supports our earlier 
conclusion that, in globular proteins, correlated 
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Figure 1, Bar diagrams comparing the proportions of pairs of residues at different distances. Distributions are rep- 
resented for all residues (filled bars) and for correlated pairs of residues (hatched bars) in papain (9pap). (a) Distances 
between pairs in the two independent domains, and (b) distances between the two domains. Correlated positions are 
shifted toward smaller distances. 



positions are statistically closer than non-correlated 
positions. 

The spatial proximity of correlated positions 
inside globular proteins can be extended to the 
proximity of correlated positions belonging to two 
different domains. The actual values of inter-do- 



main distances for all residues and for correlated 
positions belonging to two different domains for 
papain are compared in Figure 1(b). Once again a 
clear shift of the correlated pairs toward smaller 
distances is observed. This shift indicates that cor- 
related positions have a tendency to be closer to 



Table 1. The 21 monomeric protein families with two domains 



PDB 
code* 


Size 6 


Domain 
Sire' 


Domain definition* 1 


HSSP 
NAUGN* 


1-1' 


Xd 
1/2 

n-n« 


Mr* 1 


Reference 1 


A. Disjoin domains 












-2^2 


(2) 


4mt2 


61 


29/32 


1-29/30-61 


72 


Nep 


236 


3d£r 


162 


80/68 


2-32, 112-160/38-105 


18 


11,68 


7,92 


-1,63 


(3) 


4tnc 


162 


88/72 


3-90/91-162 


139 


439 


10,16 


0,45 


(2) 
(3) 


3dn 


148 


74/66 


5-78/82-147 


167 


237 


8,61 


1,63 


1dm 


148 


85/59 


4-88/89-147 


175 


138 


1031 


2,19 


(1) 


B Conjoin damsons 










336 


-0,77 


(1) 


lmd 


124 


73/51 


1-4930-103/50-79, 104-124 


62 


7,57 


4tms 


316 


223/93 


1-52,146-316/53-145 


20 


439 


Ne 


130 


(1) 




416 


199/216 


1 -188,405-415/189-404 


42 


5,07 


1,95 


236 


(1) 


C Interacting domains 










16,71 


031 


(3) 


2gcr 


173 


82/92 


1-82/83-174 


45 


5,22 


laic 


123 


67/55 


38-104/1-37,105-122 


67 


2,40 


426 


233 


(2) 


3blm 


257 


156/86 


1-67,168-256/69-154 


33 


2,94 


9,66 


322 


(3) 


2pf2 


156 


62/83 


1-62/63-145 


20 


239 


5,91 


435 


(1) 


2bbm 


148 


74/66 


5-78/82-147 


177 


4,42 


934 


431 


(3) 


lppl 


323 


212/111 


1-192304-323/193-303 


30 


439 


-U4 


433 


(1) 


test 


240 


126/103 


16-29,122-233/30-12034-245 


211 


IB35 


1837 


527 


(1) 


3adk 


195 


144/50 


1-37,88-194/38-87 


34 


13,70 


-637 


622 


(1) 


3rp2 


224 


109/108 


16-21,128-230/28-122231-243 


181 


1230 


1337 


7,10 


(3) 


9pap 


212 


112/100 


l-16,113-208/17-U23»-212 


68 


1832 


7,98 


7,12 


(1) 


2c2c 


112 


50/62 


1-33,96-112/34-95 


117 


8,60 


8,44 


835 


(2) 


3txx 


105 


72/35 


1-72/74-108 


45 


3,91 


8,46 


933 


0) 


lsgt 


223 


139/80 


16-28,69-80,121-234/29-68,81-120 


158 


1937 


15,00 


1220 


a) 



Nep, Not enough pairs for Xd calculation. 

* PDB and chain identifiers. 

k Chain length (in amino acid residues). 

c Length of each domain (in amino acid residues). 

d Domain definitions (DI/DH). 

* Number of sequences in the multiple sequence alignment From the HSSP Data Base (Sander & Schneider, 1993). 

*Xa% the harmonic weighted difference between the binned distributions of distances between all residues and correlated pairs of 
positions in the first domain (see Methods). 

* As 1 but for residue pairs in the second domain. 

h As ' but far pairs of residues belonging to different domains. 

■The domain definition was taken from: (1) Holm & Sander (1994); (2) Siddiqui & Barton (1995); (3) Sowdhazruni & Blurtdell 
(1994); (4) Swindells (1994). 
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the inter-domain interface. In order to quantify the 
difference between die population of distances 
between correlated pairs and that of all other pairs, 
we define the parameter Xd, as the harmonic 
difference between the two binned populations 
(see Methods). 

Table 1 shows the results for a set of 21 proteins 
with two domains. In almost all cases, the popu- 
lation of correlated pairs inside the individual 
domains (domain I or domain II) is shifted toward 
smaller distances as indicated by large positive Xd 
values. There are four exceptions with negative or 
very small Xd values. We suspect that this may be 
due to imprecise definitions of domain boundaries 
leading to non-perfectly globular proteins, with 
distorted residue distance distributions. 

Table 1 shows that correlated positions between 
the two protein domains are, on average, closer 
than non-correlated positions, with positive Xd 
values for 17 out of the 21 cases. In this Table, 
proteins are divided into three categories according 
to their degree of interaction, from weakly to 
strongly interacting domains (disjoin, conjoin and 
interacting, following Sowdharnini & Blundell 
(1994)). Correlated positions are closer only when 
there is a real association between domains, as 
demonstrated by higher Xd values for interacting 
domains than for any of the other two categories. 

One particular example may illustrate the differ- 
ences found in the analysis of domains with differ- 
ent degree of the interaction- In two crystal forms 
of calmodulin (FDB code 3cln (Babu et d., 1988) 
and 1dm (Rao et al., 1993)) the two protein 
domains do not interact, die closest residues 
belonging to different domains are more than 24 A 
apart. When using this structure to calculate the 
distance distributions, correlated mutations cannot 
be much closer than the rest of the residues, and in 
fact the Xd value is only 2.19 (Table 1 and 
Figure 2(a)). These crystal structures represent only 
one of two alternative forms of calmodulin. A 
different form is observed (2bbm, Dcura et al, 1992) 
where the two domains embrace a bound peptide 
substrate. When this "closed" form of calmodulin 
is used to calculate distances, correlated residues 
are closer than non-correlated ones with an Xd 
value of 4.53 (Table 1 and Figure 2(b)). This value 
is in good agreement with that measured for 
weakly-interacting domains of other proteins. In 
other words, in calmodulin, correlated mutations 
contain information about inter-domain contacts, 
and these are related to the interactions observed 
in the closed structure. 

The observation that correlated positions tend to 
be closer to domain interfaces only in truly inter- 
acting domains could be i nterpret ed as an indi- 
cation of the physical nature of correlated positions 
as compensatory mutations, since compensation 
can occur only when physical contact between 
interacting residues has led them to co-evolve. 



sdn 



1-a 




Figure 2. Bar diagrams comparing the proportions of 
pairs of residues at different distances in two confor- 
mational states of calmodulin. The distribution of dis- 
tances between all pairs of residues (filled bars) and 
correlated pairs (hatched bars) are compared for the 
open ((a) 3cln) and closed ((b) 2bbm) forms of calmodu- 
lin. There is no interaction between the two domains in 
the open form (disjoin according to Sowdharnini & 
Blundell, 1994) and a moderate interaction in the dosed 
form (classified as interacting). The population of corre- 
lated mutations is shifted more obviously toward 
shorter distances when the closed form is used to calcu- 
late the distances. We obtained Xd values of 4.31 and 
2.19 for the closed and open form, respectively. Notice 
in the particular example of (b) there are not correlated 
pairs at distances between 24 and 28 A. 



The Information about correlated positions 
may be sufficient for selecting the correct 
Inter-domain docking solutions among many 
alternative possibilities 

We generated a large number (7440) of random 
solutions for the docking of two interacting pro- 
tein domains (Figure 3) in order to cover as 
thoroughly and evenly as possible the protein 
surface with physically realistic solutions, and 
tried to singl out th correct orientation using 
correlated mutations. The right docking solutions 
are clearly distinguished from the random 
alternatives (Figure 3). 
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Figure 3. Distribution of the Xd values obtained for 
7440 docking solutions covering the full docking space 
for the two domains of cytochrome c and trypsin. The 
x-axis, harmonical difference (Xd) between the distri- 
butions of relative distances between all pairs of resi- 
dues and correlated pairs of positions (see Methods). 
The y-axis is the number of docking solutions corre- 
sponding to each Xd value, (a) 2c2c, cytochrome, (b) 
3est trypsin. The different solutions cover all the poss- 
ible range of interactions between the two domains and 
they have been selected to be physically realistic in sur- 
face complementarity. 



Because of the nature of our calculation, we are 
able to detect the solutions that best fit the 
observed pattern of correlation. This implies that, 
even if the right solution was not present in our 
set, the solutions closer to it would be identifiable. 
In fact, correlated positions tend to be found closer 
in space also in ncin-optirrial solutions whose dock- 
ing orientation is similar to the correct one. 

To illustrate the first point, we compared the Xd 
values for the best solutions (lower RMS from the 
correct orientation) with the rest of the distribution. 
Even if these solutions are relatively far apart from 
the real solution (average RMS values of 10.02 and 
14.07 A) they have high Xd values (averages of 
3.06 and 2.11). With these Xd values, they can be 
clearly distinguished from the bulk of the random 
solutions. Conversely, the ten solutions having a 
higher Xd value hav average RMS values of 27.24 



and 20.97, clearly smaller than the average values 
for all other random s lutions (34.88 and 45.44 A, 
respectively). 

We also designed a second more demanding ex- 
periment. Instead of random docking solutions we 
chose the best solutions generated with a standard 
docking program based on surface complementar- 
ity (ESCHER, see Methods). Each one of these sol- 
utions is an alternative to the real solution and 
corresponds to local optima of surface complemen- 
tarity (see Methods). 

The results for all the proteins with strong 
inter-dornain interaction in Table 1 (labelled as 
interacting domains) are given in Figure 4. First, 
there is a general correlation between RMS and 
discrimination by correlated mutations (Xd). With 
two exceptions (2gcr and 3adk), correct docking 
solutions with low RMS values have larger Xd 
values than other docking solutions with larger 
RMS values. 

Results are quantified in Table 2. Correct dock- 
ing solutions are always among the best 15% of all 
solutions, with the two exceptions mentioned 
above. We define as correct those solutions with an 
RMS lower than 5 A from the experimental struc- 
ture, since they represent only small differences in 
rotation of the two domains; this can be considered 
as the limits of resolution of the sequence-based 
method. Using this more relaxed criteria, then at 
least one right docking solution is found among 
the 8% best Xd values in 11 of the 14 cases. 

For three examples, sequence information does 
not seem to be sufficient to disoirninate among 
alternative docking solutions; the two mentioned 
before (2gcr and 3adk) and 2pf2 (where the dock- 
ing method did not find any valid alternative sol- 
ution close to the experimental structure). After 
visual inspection our interpretation is that highly 
symmetrical or elongated complexes are difficult 
for our method. It is imaginable that in a very 
elongated thin protein alternative dockmg in two 
opposed faces will give very high RMS values, 
while the distance between correlated pairs may be 
very similar in both of them. 

Figure 5(a) shows some of the solutions for pa- 
pain (9pap). The solution at 5 A RMS is very close 
to the real solution for any biological application 
A solution of high RMS shows how correlated 
pairs are clearly sitting in different faces of the pro- 
tein. The structure of a wrong docking solution 
that scores better than the real solution shows how 
correlated residues are marginally closer than in 
the right orientation due to the particular shape of 
the region in which most of the correlated residues 
are found. 

A further step: correlation between two 
different proteins 

The same idea, applied abov to protein do- 
mains, can be used to predict the interaction be- 
tween different proteins. There is a practical 
difficulty in testing this idea: currently there are 
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Figure 4. Scatter plot of the values of Xd against RMS. 
The y-axis, harmonic difference (Xd) between the distri- 
butions of relative distances between all residues and 
correlated pairs of positions (see Methods). The x-axis is 
the RMS between the real structure and the different 
alternative docking positions, (a) lsgt, trypsin; (b) 2c2c, 
cytochrome; (c) 9pap, papain; (d) 3rp2, rat mast cell pro- 



%• 


%(5Ay» 


53.49 


53.49 


15.07 


7.95 


13.90 


13.90 


13.76 


13.76 


6.25 


3.08 


2.90 


2.90 


9.76 


1.22 


28.73 


28.48 


253 


0.00 


3.08 


3.08 


7.69 


0.00 


7.75 


2.46 


0.00 


0.00 



Table 2. Percentage of docking solutions that score bet- 
ter than the correct one for 11 proteins with interacting 
domains 

PDBld 

2gcr 
laic 
3blm 
2p£2 
2bbm 
tppl 
3est 
3adk 
3rp2 
9pap 
2c2c 
3trx 

lsgt 

• Percentage of solutions scored better than the real one. 
b The same considering all the solutions in the range 0 to 5 A 
RMS as correct 



few cases where the three-dimensional structure of 
the protein complex and many corresponding se- 
quences from different species are available. 

We have chosen to test the concept using haemo- 
globin, since the wealth of sequences for this pro- 
tein made it possible to select an appropriate 
subset of sequences. We selected die otl-(J2 mono- 
mers of the tetramer because they contain the func- 
tional interface that undergoes structural changes 
between the oxy and deoxy forms (Jayaraman et al, 
1995). We generated multiple docking solutions for 
the otl-p2 dimer and used, as in previous cases, 
the information about correlated mutations to 
discTirrunate among them. As can be seen in 
Figure 5(b), there is a clear difference between 
good solutions and wrong ones. Xd values can be 
used to disairninate between them: the right sol- 
ution always scores among the first 6% and the Xd 
value is 3.36, in the range of the examples of inter- 
acting domains shown in Table 1. This result is sat- 
isfactory, especially in view of the feet that this is a 
difficult case with a very small interacting surface 
(Lesk & Chothia, 1980; Perutz, 1978). 

Prediction of contact between domains in the 
absence of three-dimensional structures 

To illustrate the predictive value of our method 
we present a prediction of the domain interaction 



tease; (e) 3est, elastase; (f) 2bbm, calmodulin bound to 
substrate; (g) 2p£2, prothrombin; (h) laic, a-lactoalbu- 
min; (i) 2gcr, y-crystallin; (j) 3blm, (^lactamase; (k) lppl 
penidllopepsin; (1) 3adk, adenylate kinase; (m) 3trx, 
thioredoxm. The different examples cover all the range 
of interacting domains. The right solution is always 
selected among the 8% best docking solutions, except 
for 2gcr, 3blm, 2p£2 and 3adk (see Table 2). Solutions 
with RMS smaller than 5 A can be considered as valid 
solutions for the level of resolutions expected from a 
method based only on sequences (vertical thick line on 
the plots). 
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hbb 




Figure 5. Scatter plot of the values of Xd against RMS for different docking solutions of (a) 9pap and (b) al-02 of hae- 
moglobin (lhbb, Kavanaugh et al., 1992). The x-axis, RMS between the real structure and the different alternative 
docking positions. The y-axis, harmonica] difference (Xd) between the distributions of relative distances of all residues 
and correlated pairs. The right docking solution is among the 5% best scored ones. In the case of haemoglobin it is 
difficult to generate alternative docking solutions close to the real one, since the surface of interaction of the mono- 
mers is sharp and small A ribbon representation of some docking solutions including the real one is shown. The resi- 
dues participating in the pairs with higher correlation value are highlighted. 
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Figure 5. Scatter plot of the value-; of X« a gainst RMS foT different docking solutions of (a; 9pap and (b) al»p2 of hae- 
moglobin (Ihbb, Kavanaiigh «/.. 1992). The .r-axis, RMS between the real structure and the different alternative 
docking positions. The y-axLs, harmonica! differ*>nce (Xd) between the distributions of relative distances of ail residues 
and correlated pairs. The right docking solution is among the 5% best scored ones. In the case of haemoglobin it is 
difficult to generate alternative docking solutions close to the real one, sine; the surface of interaction of the mor.o- 
n^.erv \& sharp and small. A ribbon representation of some docking solutions including the real one is shown. The resi- 
dues participating in the pairs with higher correlation value are highlighted. 
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Figure 6. Predicted contacts between the Nt and Ct 
domains of Hsc70. In the upper part the three-dimen- 
sional structure of the Nt domain (3hsc, Flaherty et al., 
1990) is shown as a ribbon plot. A schematic view of 
the NMR secondary structure assignment of the Ct 
domain (Morshauser et al., 1995) is shown in the lower 
panel. Strands are represented by arrows, helices by 
boxes. The residues undergoing correlated mutations 
between domains are shown as sticks in the ribbon 
plot. The ten best correlations are shaded and con- 
nected by lines. Their residue number and code are 
also given. Residues participating in the ten next best 
correlations are also shown as light sticks on the rib- 
bon plot. Additionally, correlations between residues in 
the Ct domain are represented as broken lines. The 
Figure points to a dear docking solution between the 
two p sheets of the Ct domain and between the first P 
sheet of the Ct domain and the back of the Nt domains. 



in the heat-shock protein Hsc70. The interaction 
between the Nt and Ct domains is essential for the 
function of the protein. The three-dimensional 
structure is known only for the Nt domain 
(Flaherty et al., 1990: ribbon plot in Figure 6). The 
structure of th Ct domain has been solved 
recently for DnaK, a related protein (Zhu et al., 
1996), but it is not yet publicly available. A cartoon 
depicting the secondary and super-secondary 
structure of this protein as assigned by Morshauser 
et al. (1995) is shown in Figure 6. 



The correlated mutations that we identified for 
this case clearly predict that two denned regions 
should be part of the interacting surface. Both 
regions map in the front face of the Nt domain 
(with respect to the standard view of Figure 6). 
This information could be directly tested by 
mutagenesis experiments and will be ultimately 
validated by the experimental determination of the 
complex between the Nt and Ct domains. 



Discussion 

The co-evolution of a protein-protein complex 
in different organisms must leave visible traces at 
the sequence level. Part of this information can be 
captured as correlated positions in multiple 
sequence alignments. We have previously shown 
that there is indeed a trend for correlated pairs of 
residues to be closer in space than non-correlated 
pairs of residues in single-domain proteins (Gobei 
et al., 1994). 

Here we verify that this behaviour is character- 
istic for residues in single domains (intra-protein 
contacts) and for residues sitting in two different 
protein domains (inter-protein). The information 
contained in our definition of correlated mutations 
is able to discriminate between the real docking 
solution and many other realistic but wrong 
alternatives in a significant number of cases. We 
have tested the method for two-domain proteins, 
for which it was possible to obtain a collection of 
examples. We anticipate that the same results hold 
for interacting proteins. Indeed, in the case of hae- 
moglobin, correlated positions between the a and P 
chains are sufficient to single out the right orien- 
tation of the two domains among many alterna- 
tives. 

We evaluated the performances of the method 
on known cases, and then used it to carry out a 
real prediction for the inter-domain interaction of 
the heat-shock protein HscTO. This example illus- 
trates the potential of the method to generate 
specific predictions about contacting residues and 
regions even when the protein structure is 
unknown. 



Limitations of the method 

The ability of correlated positions to discriminate 
between correct and incorrect relative positions of 
two domains is clearly related to the degree of 
physical proximity between the domains. Our 
interpretation is that only the co-evolution of 
closely interacting residues leaves detectable 
signatures at the sequence level. 

The case of calmodulin is instructive, since corre- 
lated positions properly describe the domain con- 
tacts in the closed form of the protein (2bbm) 
without being biased by the existence of an open 
form. Therefore, predictions should be carried out 
only for interacting proteins, as is the case for 
Hsc70. 
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It is important to note that other factors influ- 
ence the quality of our predictions: the quality of 
the alignment the distribution of sequences in the 
family and the family size. As a rule of thumb, pre- 
dictions are reliable only in families with more 
than 15 sequences. These sequences have to be 
well distributed, with both distant and close horn- 
ologues. Since correlation is based on the co-adap- 
tation of proteins, the analysis requires the 
alignment of co-evolved proteins. Although these 
data are not yet available for many protein 
families, the current pace of the different sequen- 
cing projects suggests that mis limitation will be 
overcome very soon. 

A further limitation of the method is that it is 
unreliable when applied to homo-multimers 
because it is impossible to distinguish between sig- 
nals coming from inter or intramonomer contacts. 
As with NMR studies on homo-multimers, this 
problem can in principle be solved (O'Donoghue 
etal, 1996). 

Future prospects 

Methods that use the three-dimensional struc- 
tures of tiie proteins to be docked (CherfUs et 
1991; Fischer et ai., 1995; Helmer-atterich & 
Tramontano, 1994; Jackson & Sternberg, 1995; Jiang 
& Kim, 1991; Shoichet & Kuntz, 1991; Stoddard & 
Koshland, 1992; Walls & Sternberg, 1992; 
Katchabki-Katzir et d., 1992) are probably more ac- 
curate in the structural detail than that proposed 
here. However, our method has trie clear 
advantage that it can be applied in the absence of 
any structural information, as we have shown here 
for Hsc70, and the prediction of contacts between 
domains could be a useful guide for experimental 
approaches even when structural information is 
not available. 

It remains a major challenge to develop methods 
for detecting molecular partners using only 
sequence information. Correlated mutations may 
be used in this context, scanning data bases of mul- 
tiple-sequence alignments for cases of compatible 
signals, presumably found in interacting proteins. 
For example, should we have a data base where 
each protein is represented by the same number of 
homologous sequences all from the same species, 
then we could in principle inspect the data base 
with a similar multiple alignment of the protein of 
interest and single out those proteins where the 
higher number of positions show a similar partem 
of variation, Le. those that have a higher number 
of correlated positions with respect to the query 
sequence. 

It remains to be seen whether the development 
of such a method is feasible, but its existence 
would be extremely valuable for the various gen- 
ome analysis projects, where complete cellular sys- 
tems are described only by the seouence of their 
components and any procedure able to predict 
their network of interactions could be of enormous 
help. 



M thods 

Selection of a test set of two domain proteins 

We have taken the two-domain proteins 
described by different authors (Holm & Sander, 
1994; Siddiqui & Barton, 1995; Sowdhamini & 
Blundell, 1994; Swindells, 1994). From the initial set 
of 80 protein families, we left out those with less 
than 15 sequences in the HSSP data base (Sander & 
Schneider, 1993). Also discarded were titiose with 
many positions with gaps (positions with more 
than 10% of gaps are not included in the calcu- 
lation of correlated mutations). Homodimers were 
also excluded, since it is impossible to distinguish 
between intra and inter-protein contacts. Our final 
list has 21 examples of two-domain proteins (given 
in Table 1 by their FDB identifiers, Bernstein et d., 
1977). We deliberately avoided manipulating the 
input data: multiple sequence alignments and 
domain definitions were taken directly from public 
sources. In the case of the haemoglobin a and P 
chains we have treated them as if they were a 
single protein with two domains by appending the 
sequences of the p chains to their corresponding a 
chains. Those species for which only one of the 
chains (a or p) is known were not included in the 
alignment. The final grand alignment contains 151 
sequences coming from 147 species. 

Calculation of correlated mutations and 
definition of correlation thresholds 

Correlated mutations were calculated as de- 
scribed (Gdbel et d., 1994). Each position in the 
alignment is coded by a distance matrix. This pos- 
ition-specific matrix contains the distances between 
all pairs of sequences at that position. Distances 
are defined by the scoring matrix of McLachlan 
(1971). The association between each pair of pos- 
itions is calculated as the average of the correlation 
for each corresponding bin of the position-specific 
matrices. Positions with more than 10% gaps or 
completely conserved were not included in the cal- 
culation. 

The exact formula used in our calculation of the 
correlation coefficient (fy) for each pair of positions 
i and / of a protein with N proteins in its alignment 

is: 

_ 1 Wjtf(Saf - <S i >)(S A> - <S y >) 

^""N* a «'°/ 

For each position in the alignment we have an 
N xN matrix where each element (k and i nxnning 
from 1 to N) is the similarity between the two 
residues (k and /) in this position (f) according to 
the given homology matrix. (Sj) is die mean of S^, 
<j t is the standard deviation of S^. 

Given that the accuracy of the predictions of 
contacts directly depends on the correlation values 
(Gdbel et d., 1994), the pairs of positions are sorted 
by their correlati n value and the top M residues 
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are defined as predicted contacts, with M pro- 
portional to the protein size. For this study, the 
number is set to half of the sequence length L, a 
compromise between accuracy of the prediction 
and the possibility of using a statistically signifi- 
cant number of correlated pairs. In practice the L/2 
most correlated pairs of residues are split in three 
classes, domain I- domain I, domain II -domain II 
and domain I -domain EL The values given in 
Table 1 refer to these classes. In two cases no 
values are given in the Table because there were 
not enough pairs of residues among the L/2 best 
correlations. 

Distance calculation and definition of the 
harmonic average (Xd) 

We have previously used ACCURACY (number 
of correctly predicted contacts over total number of 
predicted contacts) to assess the reliability of pre- 
dictions of contacts. ACCURACY is not the best 
measure in the case of domain-domain proximity, 
since we are looking for relative proximity between 
residues rather than for direct physical contact and 
in this case it is more reasonable to use a continu- 
ous measure of proximity. Distances between pairs 
of residues are grouped in bins of 4 A and the dis- 
tribution represented as relative proportions of 
pairs of contacts. Two different distributions of 
binned data are obtained for correlated pairs and 
for all pairs of positions. The difference between 
the two distributions is calculated bin by bin and 
weighted by a factor inversely proportional to the 
normalised distance (in A) of the corresponding 
bin to increase the weight of closer distances. 

Distances between residues correspond to C p -C p 
distances, C* for glycine: 

w *•» 

where, n is the number of distance bins (there are 
15 equally distributed bins from 4 to 60 A); rf* is the 
upper limit for each bin, e.g. 8 for the 4 to 8 bin 
(normalised to 60). P fc is the percentage of corre- 
lated pairs with distance between d t and d { _ x . P ta 
is the same percentage for all pairs of positions. 
Defined in this way Xd = Q indicates no separation 
between the two distance populations, Xd>0 indi- 
cates positive cases where the population of corre- 
lated pairs is shifted to smaller distances with 
respect to the population of all pairs. 

Generation of alternative docking solutions 

To test if correlated positions contain infor- 
mation about protein-protein docking we com- 
pared the distance between correlated pairs of 
residues in the real structure with the distance in 
alternative docking solutions. 

In the first experiment the full docking space 
was searched and a set of 7440 docking solutions 
were generated by rotating the second domain 



with respect to the first in 30° steps; for each orien- 
tation, ten random translations were generated to 
bring the two domains into contact (744 non- 
redundant domain I- domain II relative orien- 
tations and ten random translations for each one of 
them). For the fine-grain search around the real 
docking solution, a large number of alternative 
docking solutions were generated with a docking 
program called ESCHER (Ausiello et a/., 1997). 
Each protein is cut in 15 A thick slices and the ac- 
cessible surface of each slice is described as a poly- 
gon with 15 A sides. The polygons representing 
the first protein are orderly superimposed to the 
polygons representing the second protein and the 
complementarity between them is evaluated. The 
evaluation of the geometric fit between the two 
surfaces depends on the number of sides that can 
be superimposed mamtaining the corresponding 
vertices at a distance lower than a fixed threshold. 

Complementarity is translated into a scoring 
scheme. A complete search in the rotation space 
is exerted by rotating the smaller protein in all 
possible orientations around to the bigger one. 
The cylindrical symmetry inherent to this kind of 
approach is very convenient in order to transform 
a three-dimensional surface matching problem 
into a simplified two-dimensional polygon com- 
parison, but offers a very poor description of the 
target domain poles. In the solutions analysed 
here the target has been described only once 
with the interaction site parallel with the axis 
crossing the domain poles. In two cases (c2c and 
est) different sets of solutions were generated ro- 
tating the target protein 90° around the vertical 
axis. Hie efficiency of our method was similar 
considering one or more sets of solutions (not 
shown). For the purpose of this study the dis- 
tance between the correct solution and alternative 
docking solutions is evaluated as the RMS devi- 
ation of the position of the second protein after 
superimposing the first one. 
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